!jupyter nbconvert Project3_data_3.ipynb --to html
In this project, we are going to clean and process the following dataset.
We import pandas to work with our data, Matplotlib to plot charts, and Seaborn to make our charts prettier.
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import matplotlib.pyplot as plt
import scipy.stats as ss
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
color = sns.color_palette()
sns.set(style="darkgrid")
import scipy.stats as ss
import matplotlib.pyplot as plt
from collections import Counter
from dython._private import convert, remove_incomplete_samples, replace_nan_with_value
from dython.nominal import associations
Let's load the "HR_2016_Census_simple" which has been provided in datasets for the course. We load the spread sheet:
xls_file = pd.ExcelFile('HR_2016_Census_simple.xlsx')
xls_file
xls_file.sheet_names
rawdf = xls_file.parse('Data')
rawdf.head(10)
print('The dataset HR_2016_Census_simple has {} rows and {} features'.format(rawdf.shape[0],rawdf.shape[1]))
print('The list names of the features are : {} '.format(rawdf.columns.values))
print('Information about the type features:')
rawdf.info()
num_features = rawdf.select_dtypes(include=['int64','float64'])
categorical_features = rawdf.select_dtypes(include='object')
print('The dataset has {} numerical feature(s). '.format(num_features.shape[1]))
print('The dataset has {} categorical feature(s). '.format(categorical_features.shape[1]))
rawdf.describe()
Quick check to see if there is any missing values at all:
print(rawdf.isnull().values.any())
The answer is Yes, then let's see which columns have missing values:
# checking missing data
total = rawdf.isnull().sum().sort_values(ascending = False)
percent = 100*(rawdf.isnull().sum()/rawdf.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent, percent.cumsum()], axis=1, keys=['Total missing ', 'Percent', 'Cumulative Percent'])
missing_data.head(10)
Let's see which rows have missing data:
df_null = rawdf.loc[:, rawdf.isnull().any()]
df_null[df_null.isnull().any(axis=1)]
len(df_null[df_null.isnull().any(axis=1)])
df_null.isnull().sum()
rawdf.loc[121:126,'Geo Name']
Last rows in dataset contains information for territories. Low income status does not apply to territories. We can set null values for those to zero. Also, all columns with missing values are proportions. Let's check the values in related population columns:
rawdf.columns[rawdf.isnull().any()]
rawdf.loc[{39,81,94,96,97}, ['LW_INC_ECON_FAM','LOW_INC_UNATTACHED','POP_PRIV_HHLDS_LW_INC','CHILD17_ECON_FAM_LOW_INC','IMM_2006_2016'] ]
It is obvious that the related populations have been recorded as zero, therefore their corresponding rate should be set to zero too:
rawdf = rawdf.fillna(0)
print(rawdf.isnull().values.any())
Let's first take the first row which is the data for Canada and investigate if there are duplicated columns: columns that contain the same information but they are labeled differently.
df1 = rawdf.iloc[0]
df1_num = df1.drop(df1.index[1])
df1_num[df1_num.duplicated(keep=False)]
There are two columns POP_2016 and POP16_2A for 2016 population and two columns POP_25_54 and POP_25_54.l for Population aged 25 to 54. We check the whole dataframe to compare all rows:
rawdf[{'POP_2016','POP16_2A'}], rawdf[{'POP_25_54', 'POP_25_54.1'}]
The are essentially the same. There are two more columns with the same values but their names are not similar. Is the similarity in values for Canada a coincidence? Let's check them too:
rawdf[{'RENTER_OVER30_RATE','HOU_AFF'}]
rawdf[{'SML_POP_CNTRE_RATE','POP_PRIV_HHLDS_LW_INC_RATE'}]
The first two columns are essentially the same, but the last two are different. We drop duplicated columns:
df = rawdf.drop(['POP16_2A','POP_25_54.1', 'HOU_AFF'], axis = 1)
df.head()
# comparing sizes of data frames
print("\nNumber of features of the old data frame is:", rawdf.shape[1], "\nNumber of features of the new data frame is:",
df.shape[1] )
The dataset HR_2016_Census_simple contains information about Canada, provinces and territories and health regions based on 2016 Census short-form and long-form questionnaire. Forms can be found on: Statistics Canada. Table 17-10-0122-01 Census indicator profile, based on the 2016 Census short-form questionnaire, Canada, provinces and territories, and health regions (2017 boundaries) and Table 17-10-0123-01 Census indicator profile, based on the 2016 Census long-form questionnaire, Canada, provinces and territories, and health regions (2017 boundaries).
To build data dictionary for dataset we use the source data for short form and long form by loading them into dfshort and dflong and combine them to one data frame dfshortlong. We only keep the columns Census profile and VALUE.
dfshort = pd.read_csv('1710012201_databaseLoadingData.csv')
dfshort = dfshort[{'Census profile','VALUE'}]
dflong = pd.read_csv('1710012301_databaseLoadingData.csv')
dflong = dflong[{'Census indicator profile','VALUE'}]
dflong.columns = ['Census profile','VALUE']
dfshortlong = pd.concat([dfshort, dflong], sort=True)
dfshortlong
Now we make a data frame called data_dictionary with three columns: 'Feature', 'Type', 'Description'. "Feature" and "Type" refers to column names and type in HR_2016_Census_simple. We will use the column "Census profile" of dfshortlong as a description for each feature in HR_2016_Census_simple. Note that dfshortlong has only 91 rows, while HR_2016_Census_simple has 105 rows. Then we expect to have some null values in description column: at least (103-91=12) if all data in both datasets match.
dfshortlong = dfshortlong[{'Census profile','VALUE'}]
df1 = rawdf.iloc[0]
data_dictionary = pd.DataFrame(columns=['Feature','Type', 'Description'])
data_dictionary.loc[0] = ['Geocode', 'numeric', 'Geographic code: 1-4 digits']
data_dictionary.loc[1] = ['Geo Name', 'string', 'Geographical names: Canada, provinces and territories and health regions']
for j in range (2, len(df1)):
k = 0
for i in range(0, len(dfshortlong)):
if df1.values[j] == dfshortlong['VALUE'].values[i]:
k = 1
data_dictionary.loc[j] = [df1.index[j], 'numeric', dfshortlong['Census profile'].values[i]]
if k == 0:
data_dictionary.loc[j] = [df1.index[j], 'numeric', np.nan]
data_dictionary
We check to see if we have any null values:
print(data_dictionary.isnull().values.any())
The answer is Yes, then we collect all the rows with empty description column into null_data:
null_data = data_dictionary[data_dictionary.isnull().any(axis=1)]
null_data, len(null_data)
There are only 12 features with empty description, it seems about right. However, columns LONE_FEMALE_RATE and LONE_MALE_RATE are related to columns LONE_FEMALE and LONE_MALE and the description for those columns has been filled:
data_dictionary[data_dictionary['Feature'] == 'LONE_FEMALE_RATE'],data_dictionary[data_dictionary['Feature'] == 'LONE_MALE']
By a closer look at short form and long from we find that usually if there is a rate for a Census profile, it comes after the population for that Census profile. Therefore we can check if those rate exist in the dfshortlong that we have downloaded from statcan website:
dfshortlong.iloc[dfshortlong[dfshortlong['VALUE'] == df1['LONE_FEMALE']].index[0]:dfshortlong[dfshortlong['VALUE'] == df1['LONE_FEMALE']].index[0]+4]
It is obvious that the two rates exist in dfshortlong. Now let's have a look at the values that we have for these two features in our dataset HR_2016_Census_simple for Canada:
df1[{'LONE_FEMALE_RATE','LONE_MALE_RATE'}]
Evidently, there is a noticeable discrepancy between two datasets for these two features. The dataset we downloaded from statcan website indicates that the proportion of female lone-parent families in Canada is, 78.3% of total census families while this proportion is 21.7% for male. It is obvious the sum of these two proportions is 100%; therefore they cannot be a proportion of census families, rather they are proportion of census lone-parent families. Our dataset though indicates that the proportion of female lone-parent families in Canada is 12.8% of total census families and the proportion for male is 3.6%, that sounds like reasonable. Then we add the description for them in our data_dictionary.
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'LONE_FEMALE_RATE'].index[0]] = ['LONE_FEMALE_RATE', 'numeric', 'Female lone-parent families, proportion of census families']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'LONE_MALE_RATE'].index[0]] = ['LONE_MALE_RATE', 'numeric', 'Male lone-parent families, proportion of census families']
null_data = data_dictionary[data_dictionary.isnull().any(axis=1)]
null_data, len(null_data)
Now we have 10 features with empty description. Again looking at duplicated values for Canada below, there are couple of more values that are similar but they are for different features. Those columns are: SML_POP_CNTRE_RATE, POP_PRIV_HHLDS_LW_INC_RATE and RENTER_OVER30_RATE, HOU_AFF. We need to make sure we have the right description for them:
df1_num = df1.drop(df1.index[1])
df1_num[df1_num.duplicated(keep=False)]
#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all"
data_dictionary[data_dictionary['Feature'] == 'SML_POP_CNTRE_RATE']
data_dictionary[data_dictionary['Feature'] == 'POP_PRIV_HHLDS_LW_INC_RATE']
data_dictionary[data_dictionary['Feature'] == 'RENTER_OVER30_RATE']
data_dictionary[data_dictionary['Feature'] == 'HOU_AFF']
We notice that description for SML_POP_CNTRE_RATE and HOU_AFF are not correct. We can adjust them and also it is not difficult to find the description for the remaining features:
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'SML_POP_CNTRE_RATE'].index[0]] = ['SML_POP_CNTRE_RATE', 'numeric', 'Small population centre population, proportion of total population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'HOU_AFF'].index[0]] = ['HOU_AFF', 'numeric', 'Households spending 30% or more of household income on shelter, proportion of total shelter-cost households']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'POP11'].index[0]] = ['POP11', 'numeric', '2011 population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'GROWTH'].index[0]] = ['GROWTH', 'numeric', '2011 to 2016 population growth (%)']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'POP_0_19'].index[0]] = ['POP_0_19', 'numeric', 'Population aged 0 to 19']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'PCT_0_19'].index[0]] = ['PCT_0_19', 'numeric', 'Population aged 0 to 19, proportion of total population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'POP_65_PLUS'].index[0]] = ['POP_65_PLUS', 'numeric', 'Populaiton aged 65 and over']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'PCT_65_PLUS'].index[0]] = ['PCT_65_PLUS', 'numeric', 'Populaiton aged 65 and over, proportion of total population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'MALE'].index[0]] = ['MALE', 'numeric', 'Male population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'FEMALE'].index[0]] = ['FEMALE', 'numeric', 'Female population']
data_dictionary.iloc[data_dictionary[data_dictionary['Feature'] == 'MF_RATE'].index[0]] = ['MF_RATE', 'numeric', 'Male to female population rate']
null_data = data_dictionary[data_dictionary.isnull().any(axis=1)]
null_data, len(null_data)
There is one more feature left but it is again 2016 population. Let's check the original dataset for that column and compare it with column POP_2016:
rawdf[{'POP_2016','POP16'}]
There is a minor difference, we ignore it and drop column POP16 from df and data_dictionary:
data_dictionary = data_dictionary.drop([data_dictionary[data_dictionary['Feature'] == 'POP16'].index[0]], axis = 0)
df = df.drop(['POP16'], axis=1)
df.head()
Finally our data dictionary is complete and it describes all the features in data frame:
data_dictionary
In making data dictionary we came up with columns MALE, FEMALE and MF_RATE that did not exist in short form or long form that we downloaded from the website. Let's have a closer look:
rawdf[{'MALE','FEMALE', 'MF_RATE'}]
It seems female and male population are almost equal in all regions and their rate is almost 1, let's count:
df['MF_RATE'].value_counts()
Often the rate is 1 or very close to 1, then we can drop these three columns from df:
df = df.drop(['MALE','FEMALE','MF_RATE'], axis=1)
print("\nNumber of features of the old data frame 'rawdf' is:", rawdf.shape[1], "\nNumber of features of the new data frame 'df' is:",
df.shape[1] )
The main goal : Understanding data by exploring and trying to spot unlikely and irregular patterns.
What are the features?
So far we dropped duplicated and some other columns from dataset df. Yet we have too many features in this dataset. In this section the goal is to select the "right" features for future analysis. A practical way is looking at the pairwise correlation between features. Correlation is a measure of how close two variables are to having a linear relationship with each other. When the correlation between two features is high, those two features are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of them. Although we haven't specified a target variable, yet it is helpful to investigate pairwise correlation.
It is easy to distinguish two types of features in this dataset: those that represent rate, and those that are population or others, we call them non-rate. A quick way for primary division is pick columns containing RATE or PCT in their name. We also include column GROWTH which is essentially rate. We also exclude Geo Name from non-rate features which is a string, we still have Geocode to connect with Geo Name.
# collecting rate features in df_rate
df.rename(columns={'GROWTH':'GROWTH_RATE','PCT_0_19':'POP_0_19_RATE', 'PCT_65_PLUS':'POP_65_RATE'}, inplace=True)
df_rate = df.loc[:,df.columns.str.contains('RATE')]
# collecting non-rate features in df_nrate, excluding Geo Name which is a string, we still have Geocode to connect with Geo Name
df_nrate = df.loc[:,~df.columns.str.contains('RATE')]
df_nrate = df_nrate.drop(['Geo Name'], axis = 1)
print("\nNumber of features in rate category is:", df_rate.shape[1], "\nNumber of features in non-rate category is:",
df_nrate.shape[1] )
First let's have a look at the pairwise correlation between features in non-rate category:
corr_nrate = df_nrate.corr()
sns.heatmap(corr_nrate)
plt.title('Correlation between different features in Non-rate category')
In the heat map above, very bright cells indicating highly correlated features. We compare the correlation between features and remove one of two features that have a correlation higher than 0.9:
columns = np.full((corr_nrate.shape[0],), True, dtype=bool)
for i in range(corr_nrate.shape[0]):
for j in range(i+1, corr_nrate.shape[0]):
if corr_nrate.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = df_nrate.columns[columns]
df_nrate = df_nrate[selected_columns]
df_nrate
df_nrate.shape[1]
Out of 61 features in non-rate category, there are 50 that are highly correlated ( >=.9). Therefore we can reduce the number of features in this category to 11. Actually some of them are proportions or statistics like average or median. Only one is population. That is, all population columns are redundant to 1 which is 2016 population. Now let's check the correlation between features in the rate category:
corr_rate = df_rate.corr()
sns.heatmap(corr_rate)
plt.title('Correlation between different features in rate category')
From the heat map it seems that there is't that much redundancy in this category. Again, let's remove one of two features that have a correlation higher than 0.9:
columns = np.full((corr_rate.shape[0],), True, dtype=bool)
for i in range(corr_rate.shape[0]):
for j in range(i+1, corr_rate.shape[0]):
if corr_rate.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = df_rate.columns[columns]
df_rate = df_rate[selected_columns]
df_rate
df_rate.shape[1]
Out of 36 features in rate category, 8 are highly correlated ( >=0.9), therefore we can reduce the number of features in this category to 28. Then in total we have reduced the number of features to 39:
df39 = pd.concat([df_nrate, df_rate], axis=1)
df39.head()
df39.describe()
to_multiply = [col for col in df39 if max(df39[col]) > 100]
to_multiply
Looking at the statistics description of selected features, all features values are almost in the range (0,100) except for POP_2016, POP_DENSE, AVE_PERS_INC, AVE_PERS_INC_FEMALE, AVE_DWELL, MEDIAN_HHLD_INC. We can transform those columns values to fit in the range (0,100). We also add the geographical name for future analysis:
df39['POP_2016_in_million'] = df39['POP_2016'].values/1000000
df39['POP_DENSE_in_100'] = df39['POP_DENSE'].values/100
df39['AVE_PERS_INC_in_1000'] = df39['AVE_PERS_INC'].values/1000
df39['AVE_PERS_INC_FEMALE_in_1000'] = df39['AVE_PERS_INC_FEMALE'].values/1000
df39['MEDIAN_HHLD_INC_in_1000'] = df39['MEDIAN_HHLD_INC'].values/1000
df39['AVE_DWELL_in_100000'] = df39['AVE_DWELL'].values/100000
df39 = df39.drop(['POP_2016','POP_DENSE','AVE_PERS_INC','AVE_PERS_INC_FEMALE','AVE_DWELL','MEDIAN_HHLD_INC'], axis = 1)
df39 = pd.concat([df39, rawdf['Geo Name']], axis=1)
df39.head()
corr39 = df39.corr()
plt.figure(figsize=(27,27))
sns.heatmap(corr39,
xticklabels=corr39.columns.values,
yticklabels=corr39.columns.values, annot=True, cmap='cubehelix', square=True)
plt.title('Correlation between 39 reduced features')
To further analysis of the data we work with 39 selected features. We also pick the data for Canada, provinces and territories and store in df14 and do some analysis for Canada and provinces. We also set the index of df to be the Geocode.
df39.set_index("Geocode", inplace=True)
Geocode_14 = [1, 10, 11, 12, 13, 24, 35, 46, 47, 48, 59, 60, 61, 62]
Geoname_14 = ['Canada', 'Newfoundland and Labrador', 'Prince Edward Island', 'Nova Scotia', 'New Brunswick', 'Quebec', 'Ontario', 'Manitoba', 'Saskatchewan', 'Alberta', 'British Columbia', 'Yukon', 'Northwest Territories', 'Nunavut']
Geo14 = pd.DataFrame({'Geocode': Geocode_14, 'Geoname': Geoname_14})
df14 = df39.loc[Geocode_14]
df14
Below is the histogram of 39 selected features where we can have a quick look at the overall distribution of the features and also detect the presence of possible outliers.
import matplotlib
params = {'axes.titlesize':'40',
'xtick.labelsize':'24',
'ytick.labelsize':'24'}
matplotlib.rcParams.update(params)
df39.hist(figsize=(80,60))
plt.tight_layout()
From the above histograms we can easily detect some of the outliers; for example:
We expect to have outliers in 2016 population, while about 120 (out of 127) of regions in the list have population less than 5 million, a few (about 7) have population 5-15 million. Note 35 million is the population of Canada. The same is true for population density; while for most regions population density is less than 500 per square kilometer, few have density of 500-1000 (or even 3500-5000) per square kilometer!
In most regions, lone-parent family rate is between 10-20 percent of total census family population while for a few the rate could be as high as 50% of total census family population.
High school graduate rates in most regions are more than 70%, however for a few regions is as low as 20-50%.
We also have some outliers in Average dwelling with average value of dwelling as high as 1200,000 to 1400,000.
In this following we investigate the distribution of some selected features and flag some regions as potential outliers.
data_dictionary.set_index("Feature", inplace=True)
data_dictionary = data_dictionary.rename(index={'GROWTH':'GROWTH_RATE','POP_2016':'POP_2016_in_million','POP_DENSE':'POP_DENSE_in_100','AVE_PERS_INC':'AVE_PERS_INC_in_1000' ,'AVE_PERS_INC_FEMALE':'AVE_PERS_INC_FEMALE_in_1000','AVE_DWELL':'AVE_DWELL_in_100000','MEDIAN_HHLD_INC':'MEDIAN_HHLD_INC_in_1000'})
col = 'POP_DENSE_in_100'
plt.figure(figsize = (8, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'POP_DENSE_in_100'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
The first 10 regions in the above are potential outliers among regions where Vancouver, Toronto and Montreal particularly stand out with population density 4000-5000 per square kilometer. Among provinces PEI stands out with the population density 25 per square kilometer.
col = 'POP_2016_in_million'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('2016 Population in million', fontsize=25)
plt.show()
Below we can see the sorted distribution for Lone parent families for all regions and for provinces. Keewatin with 50% stands out among regions and Nunavut with close to 30% among provinces and territories.
col = 'LONE_RATE'
plt.figure(figsize = (8, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'LONE_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
In distribution chart below, we see 4 regions at the end of the chart with high school graduates rate less than 50% stand out.
col = 'HSG_RATE'
plt.figure(figsize = (8, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'HSG_PSG_BACH_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'PSG_ABOVE_BACH_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'IMM_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
Nunavut with long-term unemployment rate over 16% is an outlier among provinces/territories.
col = 'UE_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'EMP_RATE_25_54'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'LRG_POP_CNTR_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
PEI is again an outlier for medium population centers among provinces and territories.
col = 'MED_POP_CNTR_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'SML_POP_CNTRE_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'RRL_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
Nunavut and Alberta with growth rate over 11% and New Brunswick with negative growth rate are potential outliers.
col = 'GROWTH_RATE'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col, 'Description'], fontsize=25)
plt.show()
col = 'AVE_PERS_INC_in_1000'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Average total income (in 1000) in 2015 of population 15 years and over', fontsize=25)
plt.show()
Vancouver, North Shore and Richmond with average value of dwelling about or over a million dollars definitely stand out among the regions. While British Columbia is an outlier among provinces and territories.
col = 'AVE_DWELL_in_100000'
plt.figure(figsize = (18,40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Average value of dwelling in 100,000', fontsize=25)
plt.show()
col = 'AVE_DWELL_in_100000'
plt.figure(figsize = (18, 8))
g = df14.sort_values(by=[col])
plt.barh(df14["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title('Average value of dwelling in 100,000', fontsize=25)
plt.show()
col = 'LRG_POP_CNTR_RATE'
plt.figure(figsize = (10, 40))
g = df39.sort_values(by=[col])
plt.barh(df39["Geo Name"][g.index], g[col].values)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.title(data_dictionary.loc[col,'Description'], fontsize=25)
plt.show()
In this section we perform bivariate analysis for a selection of features.
Let's pick up 7 features with correlation higher than 0.44 with GROWTH RATE and look at their pairwise scatter plot.
#Correlation with output variable
corr_target = abs(corr39["GROWTH_RATE"])
a = corr_target.sort_values(ascending=False)[1:12]
print('the 5-top features with high correlation with growth rate are : \n \n {} '.format(a))
dfbp = df39.drop(axis = 0, index = 1)
dfbp = dfbp.drop('Geo Name', axis = 1)
dfGR = dfbp[{'GROWTH_RATE','GOVT_TRNSFR_INC', 'POP_65_RATE','IMM_2006_2016_RATE_TOTPOP','PSG_BACH_RATE', 'AVE_PERS_INC_in_1000', 'LRG_POP_CNTR_RATE', 'AVE_DWELL_in_100000'}]
sns.pairplot(dfGR)
The growth rate is almost linearly increasing(decreasing) with average person income (government transfer income). It makes sense, because government transfer income rate is decreasing when average person income increases; the government transfer income is aimed to support proportion of population with lower income. However, government transfer income seems to decrease exponentially with increasing average income not linearly.
Immigration rate from 2006-2016 and post secondary graduates rate with bachelor's degree are also linearly dependent with positive correlation. We can conclude most immigrants to Canada have university degree; this was expected as Canadian universities hire lots of international graduate students. That could also explain why growth rate increases with postgraduates rate with bachelor's degree as well as positive correlation between post graduate rate and large urban population centers; large urban population centers have one or more universities.
Average person income is higher in large urban population centers, also it increases with postgraduates rate which is expected; jobs that require postgraduate degrees are better paid. Average income rate decreases with population over 65 rate, it is expected too as 65 is the retirement age. Average values of dwelling increases with average income as expected; an expensive house is affordable for a higher income person.
Average values of dwelling is higher in large urban population centers and is lower for population over 65; retired people move to smaller/quieter and (probably) less expensive towns.
We repeat the previous section but this time we pick up features with high correlation with low income rate.
#Correlation with output variable
corr_target = abs(corr39['LW_INC_ECON_FAM_RATE'])
a = corr_target.sort_values(ascending=False)[1:16]
print('the 5-top features with high correlation with growth rate are : \n \n {} '.format(a))
dfLIn = dfbp[{'LW_INC_ECON_FAM_RATE','POP_DENSE_in_100','SML_POP_CNTRE_RATE', 'AVE_DWELL_in_100000', 'RRL_RATE', 'LRG_POP_CNTR_RATE'}]
sns.pairplot(dfLIn)
Low income rate is decreasing with rural and small population rate and increasing with large urban population rate: Families in high low income rate are less likely to live in the rural areas or small population centers and most likely to live in the large urban population centers. We can further infer that most low income families lives in dwellings with lower average values.
We also see where the population density is high, the low income family rate is high too which again support that most families with lower income live in large urban centers that have higher population density since the population density is high in large urban centers.
Population density in general is 0 (or not available) for most regions in the dataset. This was evident in the barplot of population density in previous section where Vancouver and Toronto were the most dense regions in the list. From the population density vs average values of dwelling it is obvious that Vancouver, which has the highest density, have also the highest rate of average value of dwelling.
In rural areas and small population centers, mostly the average values of dwellings are lower.
The source for the dataset is Statistics Canada therefore reliable. In preparing data dictionary we compared values of all columns for Canada with the data from statistics Canada website and they all match except for the female and lone parent families and male and female rate. We spotted a mistake on the website which was the definition of female and male lone parent families, those columns do not match our dataset as the rates are proportion of lone parent families while we believe our data is proportion of census families. The data for these columns in our dataset for Canada seems to be reasonable.
We addressed some of potential outliers while doing univariate/bivariate analysis. We can also detect outliers using boxplot as follows:
col1 = [col for col in dfbp if max(dfbp[col]) <= 40]
dfbp1 = dfbp[col1]
dfbp2 = dfbp.drop(col1, axis = 1)
dfbp2.shape[1]
dfbp1.boxplot(figsize=(15,10),grid=False, vert=False, fontsize=10)
plt.tight_layout()
dfbp2.boxplot(figsize=(15,10),grid=False, vert=False, fontsize=10)
plt.tight_layout()
We only did minor transformations for some features with large values like population, incomes and dwelling values only to keep them in the same range as values for other columns. It wasn't crucial though.